Goto

Collaborating Authors

 minority sample




Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add

Ma, Zhengchi, Zhang, Anru R.

arXiv.org Machine Learning

Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic examples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples and evaluated under the balanced population risk. Our theory shows that synthetic data is not always beneficial. In a ``local symmetry" regime, imbalance is not the dominant source of error near the balanced optimum, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help (a ``local asymmetry" regime), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing, sometimes by a small refinement and sometimes substantially when generator bias is systematic. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures when the data indicate them. Simulations and a real sepsis prediction study support the theory and illustrate when synthetic augmentation helps, when it cannot, and how to tune its quantity effectively.


Bias-Corrected Data Synthesis for Imbalanced Learning

Lyu, Pengfei, Ma, Zhengchi, Zhang, Linjun, Zhang, Anru R.

arXiv.org Machine Learning

Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naively treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.


Large Language Models for Imbalanced Classification: Diversity makes the difference

Nguyen, Dang, Gupta, Sunil, Do, Kien, Nguyen, Thin, Braund, Taylor, Whitton, Alexis, Venkatesh, Svetha

arXiv.org Machine Learning

Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.


Appendix Outline

Neural Information Processing Systems

In Appendix A.8, we examine the relationship between the training and test distributions In Appendix A.9, we present additional and extended experimental results that compare the In Appendix A.10, we provide additional experimental results that illustrate how the training In Appendix A.11, we include decision boundary visualizations for imbalanced training. In Appendix C, we discuss the limitations of our study. Lastly, in Appendix D, we discuss the broader impact of our work. This normalization allows us to examine the relative effect of batch size. Figure 7: Optimal augmentations depend on the imbalance ratio.


Uncertainty-Aware Generative Oversampling Using an Entropy-Guided Conditional Variational Autoencoder

Zare, Amirhossein, Zare, Amirhessam, Pezeshki, Parmida Sadat, Herlock, null, Rahimi, null, Ebrahimi, Ali, Vázquez-García, Ignacio, Celi, Leo Anthony

arXiv.org Artificial Intelligence

Class imbalance remains a major challenge in machine learning, especially for high-dimensional biomedical data where nonlinear manifold structures dominate. Traditional oversampling methods such as SMOTE rely on local linear interpolation, often producing implausible synthetic samples. Deep generative models like Conditional Variational Autoencoders (CVAEs) better capture nonlinear distributions, but standard variants treat all minority samples equally, neglecting the importance of uncertain, boundary-region examples emphasized by heuristic methods like Borderline-SMOTE and ADASYN. We propose Local Entropy-Guided Oversampling with a CVAE (LEO-CVAE), a generative oversampling framework that explicitly incorporates local uncertainty into both representation learning and data generation. To quantify uncertainty, we compute Shannon entropy over the class distribution in a sample's neighborhood: high entropy indicates greater class overlap, serving as a proxy for uncertainty. LEO-CVAE leverages this signal through two mechanisms: (i) a Local Entropy-Weighted Loss (LEWL) that emphasizes robust learning in uncertain regions, and (ii) an entropy-guided sampling strategy that concentrates generation in these informative, class-overlapping areas. Applied to clinical genomics datasets (ADNI and TCGA lung cancer), LEO-CVAE consistently improves classifier performance, outperforming both traditional oversampling and generative baselines. These results highlight the value of uncertainty-aware generative oversampling for imbalanced learning in domains governed by complex nonlinear structures, such as omics data.


Adaptive kernel-density approach for imbalanced binary classification

Nishimura, Kotaro J., Sakumura, Yuichi, Ikeda, Kazushi

arXiv.org Artificial Intelligence

Abstract--Class imbalance is a common challenge in real-world binary classification tasks, often leading to predictions biased toward the majority class and reduced recognition of the minority class. This issue is particularly critical in domains such as medical diagnosis and anomaly detection, where correct classification of minority classes is essential. Conventional methods often fail to deliver satisfactory performance when the imbalance ratio is extremely severe. T o address this challenge, we propose a novel approach called Kernel-density-Oriented Threshold Adjustment with Regional Optimization (KOT ARO), which extends the framework of kernel density estimation (KDE) by adaptively adjusting decision boundaries according to local sample density. In KOT ARO, the bandwidth of Gaussian basis functions is dynamically tuned based on the estimated density around each sample, thereby enhancing the classifier's ability to capture minority regions. The results demonstrated that KOT ARO outperformed conventional methods, particularly under conditions of severe imbalance, highlighting its potential as a promising solution for a wide range of imbalanced classification problems. In real-world binary classification tasks, class imbalance is a common and challenging issue, where one class significantly outnumbers the other.


GK-SMOTE: A Hyperparameter-free Noise-Resilient Gaussian KDE-Based Oversampling Approach

Miraj, Mahabubur Rahman, Huang, Hongyu, Yang, Ting, Zhao, Jinxue, Mu, Nankun, Lei, Xinyu

arXiv.org Artificial Intelligence

Imbalanced classification is a significant challenge in machine learning, especially in critical applications like medical diagnosis, fraud detection, and cybersecurity. Traditional oversampling techniques, such as SMOTE, often fail to handle label noise and complex data distributions, leading to reduced classification accuracy. In this paper, we propose GK-SMOTE, a hyperparameter-free, noise-resilient extension of SMOTE, built on Gaussian Kernel Density Estimation (KDE). GK-SMOTE enhances class separability by generating synthetic samples in high-density minority regions, while effectively avoiding noisy or ambiguous areas. This self-adaptive approach uses Gaussian KDE to differentiate between safe and noisy regions, ensuring more accurate sample generation without requiring extensive parameter tuning. Our extensive experiments on diverse binary classification datasets demonstrate that GK-SMOTE outperforms existing state-of-the-art oversampling techniques across key evaluation metrics, including MCC, Balanced Accuracy, and AUPRC. The proposed method offers a robust, efficient solution for imbalanced classification tasks, especially in noisy data environments, making it an attractive choice for real-world applications.


Learning Majority-to-Minority Transformations with MMD and Triplet Loss for Imbalanced Classification

Cha, Suman, Kim, Hyunjoong

arXiv.org Machine Learning

Traditional oversampling techniques--including SMOTE and its variants--generate synthetic minority samples via local interpolation but fail to capture global data distributions in high-dimensional spaces. Deep generative models based on GANs offer richer distribution modeling yet suffer from training instability and mode collapse under severe imbalance. To overcome these limitations, we introduce an oversampling framework that learns a parametric transformation to map majority samples into the minority distribution. Our approach minimizes the maximum mean discrepancy (MMD) between transformed and true minority samples for global alignment, and incorporates a triplet loss regularizer to enforce boundary awareness by guiding synthesized samples toward challenging borderline regions. We evaluate our method on 29 synthetic and real-world datasets, demonstrating consistent improvements over classical and generative baselines in AUROC, G-mean, F1-score, and MCC. These results confirm the robustness, computational efficiency, and practical utility of the proposed framework for imbalanced classification tasks.